This study aims to explore, clean, and prepare a dataset to effectively find correlations among the variables and classify individuals into three categories: Diabetic, Non-Diabetic, and Prediabetic. The motivation behind this project is to gain experience in data analysis and machine learning techniques while also providing a reference for others interested in similar topics.
The data we have chosen for this project was extracted from https://hbiostat.org/data/, provided by Dr. John Schorling from the Department of Medicine, School of Medicine at the University of Virginia. This data comes from a study aimed at understanding the prevalence of obesity and diabetes among African Americans, as well as other cardiovascular risk factors. More information about the data collection process can be found in the article: Willems JP, Saunders JT, DE Hunt, JB Schorling: “Prevalence of coronary heart disease risk factors among rural blacks: A community-based study.” Southern Medical Journal 90:814-820; 1997.
This project is devided into 4 main parts:
Raw Data Exploration and Ceaning: this section includes the typical first approach to the data. We will inspect the dimensions of the dataframe, explore what variables are included and what values they take and if necessary, the data will be reformated, variables will be eliminated if they do not provide useful information for the analysis, and NA’s values will be treated as convenient.
Correlation Exploration: in this section we will try to look for relations among the variables for our dataset. For this, the choosing of an appropiate statistical test will be necessary.
Principal Component Analysis: a principal component analysis (PCA) will be performed to check whether dimensionaitly of our data can be reduced without loosing too much information.
SVM, k-NN and Decison Tree: lastly, we will train 3 machine learning models after the dimesionality has been reduced and we will test how efficient they are in predicting the variable response.
We will first load our data into our workspace.
library(readr)
diabetes <- read_csv("/Users/inesmora/Downloads/diabetes.csv")
## Rows: 403 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): location, gender, frame
## dbl (16): id, chol, stab.glu, hdl, ratio, glyhb, age, height, weight, bp.1s,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Once the dataframe is loaded, we can fist use de function
head() to see the first rows and get an idea how the data
is structured.
head(diabetes)
## # A tibble: 6 × 19
## id chol stab.glu hdl ratio glyhb location age gender height weight
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <chr> <dbl> <dbl>
## 1 1000 203 82 56 3.60 4.31 Buckingham 46 female 62 121
## 2 1001 165 97 24 6.90 4.44 Buckingham 29 female 64 218
## 3 1002 228 92 37 6.20 4.64 Buckingham 58 female 61 256
## 4 1003 78 93 12 6.5 4.63 Buckingham 67 male 67 119
## 5 1005 249 90 28 8.90 7.72 Buckingham 64 male 68 183
## 6 1008 248 94 69 3.60 4.81 Buckingham 34 male 71 190
## # ℹ 8 more variables: frame <chr>, bp.1s <dbl>, bp.1d <dbl>, bp.2s <dbl>,
## # bp.2d <dbl>, waist <dbl>, hip <dbl>, time.ppn <dbl>
Our data contains variables that provide us with information related to individuals’ health and physical characteristics, such as cholesterol levels, glycated hemoglobin levels, weight, height, etc. Let’s check the dimentions of our dataframe:
dim(diabetes)
## [1] 403 19
Our data frame contains 403 rows (samples or individuals) and 19 columns (variables). Let’s explore whether NA or NULL values are present in our data.
table(is.na(diabetes))
##
## FALSE TRUE
## 7082 575
table(is.null(diabetes))
##
## FALSE
## 1
We have 575 NA values and 0 NULL values.
We can also use the function str()to quickly explore the
type of data contained in each variable:
str(diabetes)
## spc_tbl_ [403 × 19] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ id : num [1:403] 1000 1001 1002 1003 1005 ...
## $ chol : num [1:403] 203 165 228 78 249 248 195 227 177 263 ...
## $ stab.glu: num [1:403] 82 97 92 93 90 94 92 75 87 89 ...
## $ hdl : num [1:403] 56 24 37 12 28 69 41 44 49 40 ...
## $ ratio : num [1:403] 3.6 6.9 6.2 6.5 8.9 ...
## $ glyhb : num [1:403] 4.31 4.44 4.64 4.63 7.72 ...
## $ location: chr [1:403] "Buckingham" "Buckingham" "Buckingham" "Buckingham" ...
## $ age : num [1:403] 46 29 58 67 64 34 30 37 45 55 ...
## $ gender : chr [1:403] "female" "female" "female" "male" ...
## $ height : num [1:403] 62 64 61 67 68 71 69 59 69 63 ...
## $ weight : num [1:403] 121 218 256 119 183 190 191 170 166 202 ...
## $ frame : chr [1:403] "medium" "large" "large" "large" ...
## $ bp.1s : num [1:403] 118 112 190 110 138 132 161 NA 160 108 ...
## $ bp.1d : num [1:403] 59 68 92 50 80 86 112 NA 80 72 ...
## $ bp.2s : num [1:403] NA NA 185 NA NA NA 161 NA 128 NA ...
## $ bp.2d : num [1:403] NA NA 92 NA NA NA 112 NA 86 NA ...
## $ waist : num [1:403] 29 46 49 33 44 36 46 34 34 45 ...
## $ hip : num [1:403] 38 48 57 38 41 42 49 39 40 50 ...
## $ time.ppn: num [1:403] 720 360 180 480 300 195 720 1020 300 240 ...
## - attr(*, "spec")=
## .. cols(
## .. id = col_double(),
## .. chol = col_double(),
## .. stab.glu = col_double(),
## .. hdl = col_double(),
## .. ratio = col_double(),
## .. glyhb = col_double(),
## .. location = col_character(),
## .. age = col_double(),
## .. gender = col_character(),
## .. height = col_double(),
## .. weight = col_double(),
## .. frame = col_character(),
## .. bp.1s = col_double(),
## .. bp.1d = col_double(),
## .. bp.2s = col_double(),
## .. bp.2d = col_double(),
## .. waist = col_double(),
## .. hip = col_double(),
## .. time.ppn = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
We have 3 variables type chr and the rest are numerical
variables. Within the first samples, we can see a large concentration of
NA values in the variables bp.2sand bp.2d.
With the function summary()we can see statistical
summary of our data, included the quartiles, the mean, and the number of
NA’s:
summary(diabetes)
## id chol stab.glu hdl
## Min. : 1000 Min. : 78.0 Min. : 48.0 Min. : 12.00
## 1st Qu.: 4792 1st Qu.:179.0 1st Qu.: 81.0 1st Qu.: 38.00
## Median :15766 Median :204.0 Median : 89.0 Median : 46.00
## Mean :15978 Mean :207.8 Mean :106.7 Mean : 50.45
## 3rd Qu.:20336 3rd Qu.:230.0 3rd Qu.:106.0 3rd Qu.: 59.00
## Max. :41756 Max. :443.0 Max. :385.0 Max. :120.00
## NA's :1 NA's :1
## ratio glyhb location age
## Min. : 1.500 Min. : 2.68 Length:403 Min. :19.00
## 1st Qu.: 3.200 1st Qu.: 4.38 Class :character 1st Qu.:34.00
## Median : 4.200 Median : 4.84 Mode :character Median :45.00
## Mean : 4.522 Mean : 5.59 Mean :46.85
## 3rd Qu.: 5.400 3rd Qu.: 5.60 3rd Qu.:60.00
## Max. :19.300 Max. :16.11 Max. :92.00
## NA's :1 NA's :13
## gender height weight frame
## Length:403 Min. :52.00 Min. : 99.0 Length:403
## Class :character 1st Qu.:63.00 1st Qu.:151.0 Class :character
## Mode :character Median :66.00 Median :172.5 Mode :character
## Mean :66.02 Mean :177.6
## 3rd Qu.:69.00 3rd Qu.:200.0
## Max. :76.00 Max. :325.0
## NA's :5 NA's :1
## bp.1s bp.1d bp.2s bp.2d
## Min. : 90.0 Min. : 48.00 Min. :110.0 Min. : 60.00
## 1st Qu.:121.2 1st Qu.: 75.00 1st Qu.:138.0 1st Qu.: 84.00
## Median :136.0 Median : 82.00 Median :149.0 Median : 92.00
## Mean :136.9 Mean : 83.32 Mean :152.4 Mean : 92.52
## 3rd Qu.:146.8 3rd Qu.: 90.00 3rd Qu.:161.0 3rd Qu.:100.00
## Max. :250.0 Max. :124.00 Max. :238.0 Max. :124.00
## NA's :5 NA's :5 NA's :262 NA's :262
## waist hip time.ppn
## Min. :26.0 Min. :30.00 Min. : 5.0
## 1st Qu.:33.0 1st Qu.:39.00 1st Qu.: 90.0
## Median :37.0 Median :42.00 Median : 240.0
## Mean :37.9 Mean :43.04 Mean : 341.2
## 3rd Qu.:41.0 3rd Qu.:46.00 3rd Qu.: 517.5
## Max. :56.0 Max. :64.00 Max. :1560.0
## NA's :2 NA's :2 NA's :3
The variables bp.2sadn bp.2d contain most
of the NA’s values present in our data. Since a very large number (more
than half of the sample) of NAs are concentrated in two variables,
removing all observations containing these NAs could result in losing
valuable information for our study contained in the other variables; we
would reduce the number of samples from 403 to 130.
diabetes_noNA <- na.omit(diabetes)
dim(diabetes_noNA)
## [1] 130 19
If these variables are not relevant for our study, we can decide to
eliminate them and then omit NA values. However, in this case an
explanation was provided by the authors of the data. The article by
Willems JP, Saunders JT, DE Hunt, JB Schorling: Prevalence of
coronary heart disease risk factors among rural blacks: A
community-based study. Southern Medical Journal 90:814-820; 1997
explains that blood pressure was measured approximately 15 minutes after
the examination began (bp.1), and if the pressure was
greater than or equal to 140/90 mmHg, a second measurement was taken
approximately 10 minutes later (bp.2). For these cases, the
blood pressure considered by the researchers is the average of both
measurements.
Now that we understand why there are so many NA values in these fields, we can conveniently reformulate our table and then omit the NA values.
Fistly we will change the data to the metric system (optional) and we will create a couple of new variables:
diabetes$weight <- diabetes$weight * 0.453592 # From pounds to Kg
diabetes$height <- diabetes$height + 100 # This variable equals height - 100, we add the 100 back
diabetes$BMI <- with(diabetes, weight / ((height)/100)^2) # We can create a variable with the BMI for each sample
diabetes$waist <- diabetes$waist * 2.54 # From inches to cm
diabetes$hip <- diabetes$hip * 2.54 # From inches to cm
diabetes$waist_to_hip <- diabetes$waist / diabetes$hip # We can create a variable for hip to waist ratio
diabetes$bp.s <- ifelse(!is.na(diabetes$bp.1s) & !is.na(diabetes$bp.2s),
(diabetes$bp.1s + diabetes$bp.2s) / 2,
ifelse(!is.na(diabetes$bp.1s), diabetes$bp.1s,
diabetes$bp.2s))
diabetes$bp.d <- ifelse(!is.na(diabetes$bp.1d) & !is.na(diabetes$bp.2d),
(diabetes$bp.1d + diabetes$bp.2d) / 2,
ifelse(!is.na(diabetes$bp.1d), diabetes$bp.1d,
diabetes$bp.2d))
columns_to_delete <- c("bp.1s", "bp.2s", "bp.1d", "bp.2d", "time.ppn", "id", "gender", "frame", "location")
# We can now eliminate the columns "bp.1s", "bp.2s", "bp.1d" y "bp.2d", since we have incorporated the variables "bp.s" and "bp.d". We also remove other variables that we won't be using in our analysis
diabetes <- diabetes[, !(names(diabetes) %in% columns_to_delete)]
diabetes <- na.omit(diabetes)
dim(diabetes)
## [1] 377 14
Now that our data is clean, we can start doing a more thorough examination of our data.
To begin our analysis, we need to understand how the values of our variables are distributed. This will allow us to make informed decisions on the statistical testing required to find correlations. To assess the distribution of our variables, we will perform the following steps:
Histogram Visualization:
Histograms provide a graphical representation of the distribution of
each variable. They allow us to quickly assess whether the data is
approximately normal, skewed, or has heavy tails. This visual
examination is a preliminary step to identify patterns or irregularities
in the data.
Q-Q Plots (Quantile-Quantile Plots):
Q-Q plots compare the quantiles of our variables against the quantiles
of a standard normal distribution. If the data is normally distributed,
the points will approximately lie on a straight diagonal line.
Deviations from this line indicate departures from normality, such as
skewness or heavy tails.
Shapiro-Wilk Test:
This is a statistical test used to determine if a sample comes from a
normally distributed population. It is particularly effective for small
to moderate-sized datasets. However, it tends to be overly sensitive for
large datasets, often detecting minor deviations from normality that are
not practically significant.
Anderson-Darling Test:
The Anderson-Darling test is an alternative to the Shapiro-Wilk test,
especially when we are interested in assessing the fit of the data to a
normal distribution with particular sensitivity to deviations in
the tails. Unlike the Shapiro-Wilk test, the Anderson-Darling
test performs well on both small and large datasets.
Why Multiple Tests?
Using both visualizations (histograms, Q-Q plots) and statistical tests (Shapiro-Wilk, Anderson-Darling) provides a comprehensive assessment of normality. Visual methods offer an intuitive understanding of the data distribution, while statistical tests provide formal evidence to support or reject normality assumptions.
library(nortest)
check_normality <- function(data) {
# Prepare an empty data frame to store results
results <- data.frame(Variable = character(), Test = character(), Statistic = numeric(), p_value = numeric())
for (col_name in colnames(data)) {
if (is.numeric(data[[col_name]])) {
# Shapiro-Wilk Test
shapiro <- shapiro.test(data[[col_name]])
results <- rbind(results, data.frame(Variable = col_name,
Test = "Shapiro-Wilk",
Statistic = shapiro$statistic,
p_value = shapiro$p.value))
# Anderson-Darling Test
anderson_darling <- ad.test(data[[col_name]])
results <- rbind(results, data.frame(Variable = col_name,
Test = "Anderson-Darling",
Statistic = anderson_darling$statistic,
p_value = anderson_darling$p.value))
# Plot Histogram and QQ plot for each variable
par(mfrow = c(1, 2))
hist(data[[col_name]], main = paste("Histogram of", col_name), col = "skyblue", breaks = 20)
qqnorm(data[[col_name]], main = paste("QQ Plot of", col_name))
qqline(data[[col_name]], col = "red")
}
}
# Reset plot layout to single plot
par(mfrow = c(1, 1))
# Return the results data frame
return(results)
}
normality_results <- check_normality(diabetes)
We have visually explored the distribution of our data using
histograms and Q-Q plots. This initial inspection provides a rough idea
of whether our variables follow a normal distribution. For some
variables, such as bp.d and waist_to_hip, the
Q-Q plots and histograms suggest a reasonable approximation to
normality, but the presence of longer tails warrants closer
examination.
On the other hand, variables like glyhb and
stab.glu clearly deviate from normality, displaying skewed
distributions and notable departures from the Q-Q line.
However, visual inspection alone is not sufficient to conclusively determine normality. Therefore, it is essential to apply formal statistical tests that provide quantitative evidence of normality or lack thereof.
print(normality_results)
## Variable Test Statistic p_value
## W chol Shapiro-Wilk 0.9563425 3.859776e-09
## A chol Anderson-Darling 2.7910413 4.909949e-07
## W1 stab.glu Shapiro-Wilk 0.6529111 5.429144e-27
## A1 stab.glu Anderson-Darling 45.4074719 3.700000e-24
## W2 hdl Shapiro-Wilk 0.9201966 2.885412e-13
## A2 hdl Anderson-Darling 7.7388169 6.630838e-19
## W3 ratio Shapiro-Wilk 0.8648210 1.378889e-17
## A3 ratio Anderson-Darling 6.1704161 3.476246e-15
## W4 glyhb Shapiro-Wilk 0.7226508 1.468786e-24
## A4 glyhb Anderson-Darling 37.0011052 3.700000e-24
## W5 age Shapiro-Wilk 0.9733102 2.044750e-06
## A5 age Anderson-Darling 2.5055371 2.444036e-06
## W6 height Shapiro-Wilk 0.9877327 2.870518e-03
## A6 height Anderson-Darling 1.8212077 1.159347e-04
## W7 weight Shapiro-Wilk 0.9647110 6.839620e-08
## A7 weight Anderson-Darling 3.0563492 1.107998e-07
## W8 waist Shapiro-Wilk 0.9783109 1.940414e-05
## A8 waist Anderson-Darling 2.1253135 2.081800e-05
## W9 hip Shapiro-Wilk 0.9581573 6.986266e-09
## A9 hip Anderson-Darling 4.4449305 4.778402e-11
## W10 BMI Shapiro-Wilk 0.9633143 4.124121e-08
## A10 BMI Anderson-Darling 3.2871773 3.040569e-08
## W11 waist_to_hip Shapiro-Wilk 0.9888348 5.561062e-03
## A11 waist_to_hip Anderson-Darling 0.6480560 9.020888e-02
## W12 bp.s Shapiro-Wilk 0.9371397 1.573833e-11
## A12 bp.s Anderson-Darling 4.3094905 1.014257e-10
## W13 bp.d Shapiro-Wilk 0.9945953 2.069029e-01
## A13 bp.d Anderson-Darling 0.6509197 8.874930e-02
NOTE
bp.d: Both tests (p = 0.20 and p = 0.0887) suggest this
variable is normally distributed.Now that we have ruled out normality in most of our data, we have valuable information on how to proceed to look for correlations within our variables. Since parametric correlation methods like Pearson’s correlation assume normality and linearity, applying them to non-normally distributed data could produce misleading results.
Instead, we will rely on non-parametric correlation methods, which are more robust to violations of normality and can detect monotonic relationships even if they are not perfectly linear. The two most commonly used methods are:
cor_matrices <- list(
Spearman = cor(diabetes, use = "pairwise.complete.obs", method = "spearman"),
Kendall = cor(diabetes, use = "pairwise.complete.obs", method = "kendall")
)
To more comfortably observe relationships within the variables, we will create a heatmap with the correaltion matrices.
library(ggplot2)
library(reshape2)
plot_heatmap <- function(cor_matrix, title) {
melted_cor <- melt(cor_matrix)
ggplot(data = melted_cor, aes(x = Var1, y = Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
labs(title = title, fill = "Correlation")
}
plot_heatmap(cor_matrices$Spearman, "Spearman Correlation Matrix")
plot_heatmap(cor_matrices$Kendall, "Kendall Correlation Matrix")
Firsty, we observe that Kendall’s Tau identifies less strong correlations within the data than Spearman.
We can observe some variables that are strongly correlated with each
other, either directly or indirectly. Some correlations are not
surprising, such us hip to waist,
weight to waistand hip. We do not
see very high correlations between GlyHb and the other variables, except
in the case of stabilized glucose levels (stab.glu). This
makes sense because, in general, glycated hemoglobin is an indicator of
long-term glycemic control and reflects the average blood glucose
exposure over a prolonged period, approximately the last 2 to 3 months.
The second variable with a moderately positive correlation is the
cholesterol-to-HDL ratio (ratio). We observe other
variables that are positively related to a lesser degree.
We can now check which correlations are statistically significant.
get_significance <- function(data, method = "kendall") {
n <- ncol(data)
sig_matrix <- matrix(NA, n, n)
rownames(sig_matrix) <- colnames(data)
colnames(sig_matrix) <- colnames(data)
for (i in 1:(n - 1)) {
for (j in (i + 1):n) {
test <- cor.test(data[[i]], data[[j]], method = method)
sig_matrix[i, j] <- test$p.value
sig_matrix[j, i] <- test$p.value
}
}
return(sig_matrix)
}
get_significance(diabetes, method = "spearman")
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
## chol stab.glu hdl ratio glyhb
## chol NA 4.517672e-03 7.610405e-03 7.101661e-16 6.041701e-06
## stab.glu 4.517672e-03 NA 1.126925e-04 4.618470e-07 2.413169e-28
## hdl 7.610405e-03 1.126925e-04 NA 1.389484e-92 8.126834e-05
## ratio 7.101661e-16 4.618470e-07 1.389484e-92 NA 1.130869e-09
## glyhb 6.041701e-06 2.413169e-28 8.126834e-05 1.130869e-09 NA
## age 1.073414e-08 3.373630e-11 3.414139e-01 1.294828e-04 2.991115e-18
## height 2.556725e-02 9.958378e-01 1.092279e-02 2.182212e-01 6.745027e-01
## weight 2.904516e-01 2.541528e-04 1.005076e-10 2.474987e-12 4.508379e-05
## waist 3.957307e-02 1.294796e-06 1.374631e-10 5.872831e-14 1.970517e-10
## hip 5.207256e-02 1.097362e-04 2.767752e-06 2.997598e-09 1.257711e-05
## BMI 9.596863e-02 1.040834e-04 8.292849e-10 1.929830e-12 2.274646e-05
## waist_to_hip 6.436984e-02 6.038970e-04 5.580024e-05 1.199265e-06 7.482895e-07
## bp.s 4.752879e-05 5.244497e-07 8.903504e-01 5.394255e-02 1.331098e-07
## bp.d 2.224756e-04 1.309166e-01 1.752964e-01 5.912074e-01 1.969302e-01
## age height weight waist
## chol 1.073414e-08 2.556725e-02 2.904516e-01 3.957307e-02
## stab.glu 3.373630e-11 9.958378e-01 2.541528e-04 1.294796e-06
## hdl 3.414139e-01 1.092279e-02 1.005076e-10 1.374631e-10
## ratio 1.294828e-04 2.182212e-01 2.474987e-12 5.872831e-14
## glyhb 2.991115e-18 6.745027e-01 4.508379e-05 1.970517e-10
## age NA 1.023866e-01 6.584764e-01 1.062139e-03
## height 1.023866e-01 NA 7.772381e-08 2.597328e-01
## weight 6.584764e-01 7.772381e-08 NA 7.795145e-104
## waist 1.062139e-03 2.597328e-01 7.795145e-104 NA
## hip 5.053197e-01 1.727417e-02 1.443625e-88 1.140091e-96
## BMI 8.925369e-01 1.866302e-01 1.052505e-240 1.678204e-115
## waist_to_hip 1.411603e-07 9.324900e-08 4.226745e-08 9.510770e-28
## bp.s 6.339951e-22 8.921703e-01 7.490632e-03 5.686684e-06
## bp.d 5.243543e-02 2.813468e-01 1.315254e-04 4.140181e-05
## hip BMI waist_to_hip bp.s bp.d
## chol 5.207256e-02 9.596863e-02 6.436984e-02 4.752879e-05 2.224756e-04
## stab.glu 1.097362e-04 1.040834e-04 6.038970e-04 5.244497e-07 1.309166e-01
## hdl 2.767752e-06 8.292849e-10 5.580024e-05 8.903504e-01 1.752964e-01
## ratio 2.997598e-09 1.929830e-12 1.199265e-06 5.394255e-02 5.912074e-01
## glyhb 1.257711e-05 2.274646e-05 7.482895e-07 1.331098e-07 1.969302e-01
## age 5.053197e-01 8.925369e-01 1.411603e-07 6.339951e-22 5.243543e-02
## height 1.727417e-02 1.866302e-01 9.324900e-08 8.921703e-01 2.813468e-01
## weight 1.443625e-88 1.052505e-240 4.226745e-08 7.490632e-03 1.315254e-04
## waist 1.140091e-96 1.678204e-115 9.510770e-28 5.686684e-06 4.140181e-05
## hip NA 4.759370e-120 7.947668e-01 4.513842e-04 5.181553e-05
## BMI 4.759370e-120 NA 5.355218e-06 2.884800e-03 5.310051e-05
## waist_to_hip 7.947668e-01 5.355218e-06 NA 3.477124e-03 2.018276e-01
## bp.s 4.513842e-04 2.884800e-03 3.477124e-03 NA 5.176381e-40
## bp.d 5.181553e-05 5.310051e-05 2.018276e-01 5.176381e-40 NA
get_significance(diabetes, method = "kendall")
## chol stab.glu hdl ratio glyhb
## chol NA 6.382198e-03 6.293696e-03 6.529747e-16 7.475404e-06
## stab.glu 6.382198e-03 NA 1.173264e-04 5.035866e-07 7.217143e-28
## hdl 6.293696e-03 1.173264e-04 NA 2.031745e-73 9.132880e-05
## ratio 6.529747e-16 5.035866e-07 2.031745e-73 NA 2.143410e-09
## glyhb 7.475404e-06 7.217143e-28 9.132880e-05 2.143410e-09 NA
## age 1.413286e-08 1.174459e-10 3.112328e-01 1.007381e-04 4.718043e-17
## height 2.455952e-02 9.803541e-01 1.113324e-02 2.146913e-01 6.737615e-01
## weight 2.732637e-01 2.795915e-04 1.505147e-10 1.204130e-11 5.808091e-05
## waist 3.500692e-02 1.541636e-06 2.405686e-10 3.338327e-13 3.113901e-10
## hip 5.330542e-02 1.066166e-04 3.209184e-06 7.294977e-09 1.606475e-05
## BMI 8.616310e-02 1.010760e-04 8.880549e-10 6.356182e-12 3.550117e-05
## waist_to_hip 5.417245e-02 6.040125e-04 4.411779e-05 9.070827e-07 9.448597e-07
## bp.s 4.146121e-05 1.270779e-06 8.692446e-01 5.273371e-02 2.179977e-07
## bp.d 2.478486e-04 1.445828e-01 1.725065e-01 5.967590e-01 1.886785e-01
## age height weight waist hip
## chol 1.413286e-08 2.455952e-02 2.732637e-01 3.500692e-02 5.330542e-02
## stab.glu 1.174459e-10 9.803541e-01 2.795915e-04 1.541636e-06 1.066166e-04
## hdl 3.112328e-01 1.113324e-02 1.505147e-10 2.405686e-10 3.209184e-06
## ratio 1.007381e-04 2.146913e-01 1.204130e-11 3.338327e-13 7.294977e-09
## glyhb 4.718043e-17 6.737615e-01 5.808091e-05 3.113901e-10 1.606475e-05
## age NA 9.871842e-02 6.379594e-01 9.533447e-04 5.141915e-01
## height 9.871842e-02 NA 1.519901e-07 2.816247e-01 1.382471e-02
## weight 6.379594e-01 1.519901e-07 NA 4.634765e-80 1.497740e-71
## waist 9.533447e-04 2.816247e-01 4.634765e-80 NA 2.137535e-76
## hip 5.141915e-01 1.382471e-02 1.497740e-71 2.137535e-76 NA
## BMI 8.693605e-01 2.191128e-01 1.338262e-137 5.334134e-86 4.183931e-90
## waist_to_hip 1.787569e-07 1.339103e-07 4.794731e-08 2.536168e-28 5.386411e-01
## bp.s 6.533636e-21 9.000141e-01 7.007826e-03 6.061786e-06 3.493373e-04
## bp.d 5.143506e-02 2.990254e-01 1.071346e-04 4.023362e-05 4.365406e-05
## BMI waist_to_hip bp.s bp.d
## chol 8.616310e-02 5.417245e-02 4.146121e-05 2.478486e-04
## stab.glu 1.010760e-04 6.040125e-04 1.270779e-06 1.445828e-01
## hdl 8.880549e-10 4.411779e-05 8.692446e-01 1.725065e-01
## ratio 6.356182e-12 9.070827e-07 5.273371e-02 5.967590e-01
## glyhb 3.550117e-05 9.448597e-07 2.179977e-07 1.886785e-01
## age 8.693605e-01 1.787569e-07 6.533636e-21 5.143506e-02
## height 2.191128e-01 1.339103e-07 9.000141e-01 2.990254e-01
## weight 1.338262e-137 4.794731e-08 7.007826e-03 1.071346e-04
## waist 5.334134e-86 2.536168e-28 6.061786e-06 4.023362e-05
## hip 4.183931e-90 5.386411e-01 3.493373e-04 4.365406e-05
## BMI NA 3.983044e-06 2.657579e-03 4.201644e-05
## waist_to_hip 3.983044e-06 NA 4.303401e-03 2.271859e-01
## bp.s 2.657579e-03 4.303401e-03 NA 1.413531e-36
## bp.d 4.201644e-05 2.271859e-01 1.413531e-36 NA
When ckecking correlations with Spearsman, we get warned about ties. This means that multiple samples have the same values for some variables, which can make Spearsman’s test slighly biased. When data has a significant number of ties, Kendall’s Tau is generally more reliable than Spearman’s Rank Correlation because:
Since our data has many ties, we will rely on Kendall’s Tau.
Let’s print the results once more:
get_significance(diabetes, method = "kendall")
## chol stab.glu hdl ratio glyhb
## chol NA 6.382198e-03 6.293696e-03 6.529747e-16 7.475404e-06
## stab.glu 6.382198e-03 NA 1.173264e-04 5.035866e-07 7.217143e-28
## hdl 6.293696e-03 1.173264e-04 NA 2.031745e-73 9.132880e-05
## ratio 6.529747e-16 5.035866e-07 2.031745e-73 NA 2.143410e-09
## glyhb 7.475404e-06 7.217143e-28 9.132880e-05 2.143410e-09 NA
## age 1.413286e-08 1.174459e-10 3.112328e-01 1.007381e-04 4.718043e-17
## height 2.455952e-02 9.803541e-01 1.113324e-02 2.146913e-01 6.737615e-01
## weight 2.732637e-01 2.795915e-04 1.505147e-10 1.204130e-11 5.808091e-05
## waist 3.500692e-02 1.541636e-06 2.405686e-10 3.338327e-13 3.113901e-10
## hip 5.330542e-02 1.066166e-04 3.209184e-06 7.294977e-09 1.606475e-05
## BMI 8.616310e-02 1.010760e-04 8.880549e-10 6.356182e-12 3.550117e-05
## waist_to_hip 5.417245e-02 6.040125e-04 4.411779e-05 9.070827e-07 9.448597e-07
## bp.s 4.146121e-05 1.270779e-06 8.692446e-01 5.273371e-02 2.179977e-07
## bp.d 2.478486e-04 1.445828e-01 1.725065e-01 5.967590e-01 1.886785e-01
## age height weight waist hip
## chol 1.413286e-08 2.455952e-02 2.732637e-01 3.500692e-02 5.330542e-02
## stab.glu 1.174459e-10 9.803541e-01 2.795915e-04 1.541636e-06 1.066166e-04
## hdl 3.112328e-01 1.113324e-02 1.505147e-10 2.405686e-10 3.209184e-06
## ratio 1.007381e-04 2.146913e-01 1.204130e-11 3.338327e-13 7.294977e-09
## glyhb 4.718043e-17 6.737615e-01 5.808091e-05 3.113901e-10 1.606475e-05
## age NA 9.871842e-02 6.379594e-01 9.533447e-04 5.141915e-01
## height 9.871842e-02 NA 1.519901e-07 2.816247e-01 1.382471e-02
## weight 6.379594e-01 1.519901e-07 NA 4.634765e-80 1.497740e-71
## waist 9.533447e-04 2.816247e-01 4.634765e-80 NA 2.137535e-76
## hip 5.141915e-01 1.382471e-02 1.497740e-71 2.137535e-76 NA
## BMI 8.693605e-01 2.191128e-01 1.338262e-137 5.334134e-86 4.183931e-90
## waist_to_hip 1.787569e-07 1.339103e-07 4.794731e-08 2.536168e-28 5.386411e-01
## bp.s 6.533636e-21 9.000141e-01 7.007826e-03 6.061786e-06 3.493373e-04
## bp.d 5.143506e-02 2.990254e-01 1.071346e-04 4.023362e-05 4.365406e-05
## BMI waist_to_hip bp.s bp.d
## chol 8.616310e-02 5.417245e-02 4.146121e-05 2.478486e-04
## stab.glu 1.010760e-04 6.040125e-04 1.270779e-06 1.445828e-01
## hdl 8.880549e-10 4.411779e-05 8.692446e-01 1.725065e-01
## ratio 6.356182e-12 9.070827e-07 5.273371e-02 5.967590e-01
## glyhb 3.550117e-05 9.448597e-07 2.179977e-07 1.886785e-01
## age 8.693605e-01 1.787569e-07 6.533636e-21 5.143506e-02
## height 2.191128e-01 1.339103e-07 9.000141e-01 2.990254e-01
## weight 1.338262e-137 4.794731e-08 7.007826e-03 1.071346e-04
## waist 5.334134e-86 2.536168e-28 6.061786e-06 4.023362e-05
## hip 4.183931e-90 5.386411e-01 3.493373e-04 4.365406e-05
## BMI NA 3.983044e-06 2.657579e-03 4.201644e-05
## waist_to_hip 3.983044e-06 NA 4.303401e-03 2.271859e-01
## bp.s 2.657579e-03 4.303401e-03 NA 1.413531e-36
## bp.d 4.201644e-05 2.271859e-01 1.413531e-36 NA
The analysis reveals several biologically relevant associations between the variables studied. Cholesterol levels show significant correlations with glycemic control (glyhb) and body weight, which aligns with established relationships between metabolic health and lipid profiles. Glycated hemoglobin (glyhb) is notably correlated with variables such as waist circumference, hip circumference, BMI, and waist-to-hip ratio, all of which are known indicators of metabolic syndrome and insulin resistance. The robust correlation between glyhb and these anthropometric measures reinforces their importance as risk factors for metabolic disorders.
Blood pressure (both systolic and diastolic) exhibits significant associations with weight, waist circumference, and BMI. This relationship supports the well-documented link between obesity and hypertension. Additionally, the association of blood pressure with glycated hemoglobin suggests potential interactions between cardiovascular health and blood glucose regulation.
Interestingly, cholesterol levels also show correlations with various anthropometric measures, albeit to a lesser degree. The observed associations could indicate underlying metabolic processes linking lipid metabolism, body composition, and cardiovascular risk.
Principal Component Analysis (PCA) is a powerful dimensionality reduction technique commonly used in data analysis and machine learning. Its primary purpose is to reduce the number of features (variables) in a dataset while preserving as much of the original information as possible.
PCA achieves dimensionality reduction by transforming the original variables into a new set of uncorrelated variables, called Principal Components (PCs). These components are linear combinations of the original variables and are ranked in order of the amount of variance they explain in the data:
The idea is that a small number of these principal components can capture most of the variability present in the dataset.
In this study, PCA will be used to reduce the dimensionality of the dataset, making visualization of classification boundaries feasible and enhancing our understanding of how the different classes (Diabetic, Non-Diabetic, Prediabetic) are distributed in the feature space.
#install.packages("factoextra")
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
# Performing PCA
pca_result <- prcomp(diabetes, scale = TRUE)
# Scree Plot
fviz_eig(pca_result, addlabels = TRUE, ylim = c(0, 50))
# Inspecting Loadings
pca_result$rotation
## PC1 PC2 PC3 PC4 PC5
## chol 0.12106437 0.32852784 0.09540116 -0.261180432 0.26120018
## stab.glu 0.20339463 0.35474272 -0.21294936 -0.093844621 -0.43740303
## hdl -0.20747129 0.05412221 0.44894797 0.005316485 -0.42698331
## ratio 0.25792699 0.20769091 -0.35620251 -0.139600037 0.53159280
## glyhb 0.21178246 0.39656574 -0.19119272 -0.134487043 -0.36680966
## age 0.10231727 0.43875964 0.12847883 -0.010736354 -0.07787485
## height 0.06017702 -0.01502318 -0.19320881 0.680602238 -0.03807773
## weight 0.42144152 -0.26130439 0.03669510 0.074855796 -0.07904174
## waist 0.43972728 -0.12740608 0.07748852 0.059980001 -0.08837684
## hip 0.38264738 -0.26267308 0.15638117 -0.273762712 -0.08644486
## BMI 0.42168038 -0.26818929 0.08148923 -0.067661076 -0.07408503
## waist_to_hip 0.20680716 0.17428340 -0.10248219 0.529440313 -0.02414247
## bp.s 0.14444393 0.31440355 0.46903073 0.114803693 0.18212406
## bp.d 0.12892566 0.12451102 0.50772687 0.199569853 0.26965213
## PC6 PC7 PC8 PC9 PC10
## chol 0.705321921 0.12688528 -0.017032969 -0.056273090 0.056140413
## stab.glu -0.056658388 -0.31109247 0.137363186 -0.049290548 0.679964826
## hdl 0.459112314 0.13362516 0.006951515 -0.060631175 0.010991480
## ratio 0.095505846 -0.03331440 -0.022922154 -0.020791684 0.039042914
## glyhb -0.001285710 -0.25503988 0.136037520 0.048012644 -0.720018943
## age -0.271081058 0.42748164 -0.573313561 0.424678793 0.055398342
## height 0.313971199 -0.32091486 -0.481607058 -0.014277984 -0.043556356
## weight 0.118780957 -0.05278748 -0.123655082 0.011959536 0.034773355
## waist -0.019996696 0.25303886 0.121538380 -0.043747817 -0.037528189
## hip -0.031172534 -0.02854156 -0.170078493 0.002869109 -0.050118663
## BMI 0.055359056 0.01226860 -0.036546153 0.021796688 0.040238278
## waist_to_hip 0.004124864 0.50690277 0.471953518 -0.057231493 0.005012728
## bp.s -0.295429436 -0.14433118 -0.131834949 -0.694267688 -0.036110116
## bp.d -0.038731416 -0.41451369 0.317304949 0.565341401 0.037146983
## PC11 PC12 PC13 PC14
## chol 0.03779151 0.4585338359 -0.0013531162 -1.496033e-04
## stab.glu 0.05820936 0.0163813461 -0.0034398400 -2.236330e-03
## hdl -0.03446568 -0.5800369281 -0.0035681033 7.909653e-04
## ratio -0.01692908 -0.6696090838 -0.0012038277 -1.092763e-03
## glyhb -0.06575122 0.0091542984 0.0005846927 -1.000524e-06
## age -0.07691537 -0.0007675883 0.0088399172 2.212614e-03
## height 0.19654632 0.0196292506 -0.0347232709 -1.353586e-01
## weight -0.45867671 0.0190344508 0.1758731269 6.824173e-01
## waist 0.44465544 -0.0266453685 0.6806984297 -1.693929e-01
## hip 0.53513029 -0.0417907359 -0.5754624745 1.587732e-01
## BMI -0.49298239 0.0152660734 -0.1732957924 -6.735072e-01
## waist_to_hip -0.01990592 0.0210048243 -0.3783662012 9.196216e-02
## bp.s -0.06982520 0.0150908588 -0.0064336114 -8.540078e-04
## bp.d 0.04548282 -0.0216511145 0.0039231119 6.319832e-04
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Extracting the loadings (arrows)
loadings <- as.data.frame(pca_result$rotation[, 1:3])
pca_scores <- as.data.frame(pca_result$x[, 1:3])
# Scale the loadings for better visualization
loadings$PC1 <- loadings$PC1 * max(abs(pca_scores$PC1))
loadings$PC2 <- loadings$PC2 * max(abs(pca_scores$PC2))
loadings$PC3 <- loadings$PC3 * max(abs(pca_scores$PC3))
# Create the 3D scatter plot with PCA scores
fig <- plot_ly() %>%
add_trace(
data = pca_scores,
x = ~PC1,
y = ~PC2,
z = ~PC3,
type = 'scatter3d',
mode = 'markers',
marker = list(size = 3, color = 'orange'),
name = 'Data Points'
)
# Adding loadings as arrows, one by one
for (i in 1:nrow(loadings)) {
fig <- fig %>%
add_trace(
x = c(0, loadings$PC1[i]),
y = c(0, loadings$PC2[i]),
z = c(0, loadings$PC3[i]),
type = 'scatter3d',
mode = 'lines+text',
line = list(color = 'steelblue', width = 4),
text = rownames(loadings)[i],
textposition = 'top center',
name = paste("Loading", rownames(loadings)[i])
)
}
# Layout
fig <- fig %>%
layout(
title = "3D PCA Plot with Loadings",
scene = list(
xaxis = list(title = 'PC1'),
yaxis = list(title = 'PC2'),
zaxis = list(title = 'PC3')
)
)
fig
This is a biplot, which shows the projection of the original variables (arrows) and data points (orange dots) in the space defined by the first three principal components. The length and direction of each arrow indicate how strongly each original variable contributes to the PCs and their correlations. - Variables pointing in the same direction are positively correlated. - Variables pointing in opposite directions are negatively correlated. - Longer arrows indicate stronger contributions to the PCs.
This table shows the contribution of each original variable to the principal components (loadings). High loadings (absolute values) indicate strong relationships between the variable and the principal component.
In this section, we proceed with categorizing the dataset
into three classes: Non-Diabetic, Prediabetic, and Diabetic
based on the key variable stab.glu (stable
glucose). After categorization and PCA, we proceed to reduce
the dimensionality of the data, making visualization and further
analysis more efficient.
| NOTE |
Having an actual diagnosis of diabetes would be ideal, but since
this is missing, we will create it based soley on stab.glu
so that we can proceed with a simple machine learning analysis. |
Additionally, we employ K-means clustering to identify natural groupings within the data and compare these clusters with the labeled categories. This helps us gain insights into how well the clustering algorithm identifies patterns that correspond to the diabetes status categories.
# Categorize stab.glu into Non-Diabetic, Prediabetic, and Diabetic
diabetes$DiabetesStatus <- cut(diabetes$stab.glu,
breaks = c(-Inf, 99, 125, Inf),
labels = c("Non-Diabetic", "Prediabetic", "Diabetic"))
# Extracting the relevant columns (standardizing them)
data_scaled <- scale(diabetes[, c('chol', 'stab.glu', 'hdl', 'ratio', 'glyhb', 'age', 'height', 'weight', 'waist', 'hip', 'BMI', 'waist_to_hip', 'bp.s', 'bp.d')])
# PCA
pca_res <- prcomp(data_scaled, scale. = TRUE)
pca_data <- as.data.frame(pca_res$x)
pca_data$DiabetesStatus <- diabetes$DiabetesStatus
# K-means clustering with 3 clusters
set.seed(123)
kmeans_res <- kmeans(pca_data[, c('PC1', 'PC2', 'PC3')], centers = 3)
pca_data$Cluster <- as.factor(kmeans_res$cluster)
# Interactive 3D Plot with Plotly (Diabetes Status)
plot1 <- plot_ly(pca_data, x = ~PC1, y = ~PC2, z = ~PC3, color = ~DiabetesStatus, colors = c('red', 'green', 'blue')) %>%
add_markers() %>%
layout(title = "3D PCA Plot with Diabetes Status",
scene = list(xaxis = list(title = 'PC1'),
yaxis = list(title = 'PC2'),
zaxis = list(title = 'PC3')))
# Interactive 3D Plot with Plotly (K-means Clustering)
plot2 <- plot_ly(pca_data, x = ~PC1, y = ~PC2, z = ~PC3, color = ~Cluster, colors = c('red', 'green', 'blue')) %>%
add_markers() %>%
layout(title = "3D K-means Clustering on PCA-Reduced Data",
scene = list(xaxis = list(title = 'PC1'),
yaxis = list(title = 'PC2'),
zaxis = list(title = 'PC3')))
plot1
plot2
PCA Scatter Plot with Diabetes Status
- The plot shows data points reduced to three principal components (PC1,
PC2 and PC3) and colored by diabetes status:
- It seems like there’s some overlap between groups, especially between
Non-Diabetic and Prediabetic points. However, the Diabetic group (blue)
appears more separated and tends to occupy one side of the plot, which
suggests some level of distinguishability.
K-means Clustering on PCA-Reduced Data
- This scatter plot shows clusters formed using K-means
clustering on the reduced data.
- The clustering seems to capture some meaningful structure:
- Blue Cluster (Cluster 3) appears to align somewhat
with the Diabetic points from the previous image.
- Green Cluster (Cluster 2) seems to represent the
middle ground, likely containing mostly Prediabetic points.
- Red Cluster (Cluster 1) seems to represent
Non-Diabetic points, although there is still some overlap.
Lastly, we will train 3 models on the data and evaluate how well they perform when classifying new data into diabetic, pre-diabetic or non diabetic.
k-Nearest Neighbors (kNN):
kNN is a simple, instance-based learning algorithm used for
classification and regression. It works by finding the ‘k’ most similar
data points (neighbors) to a given input and making predictions based on
their labels.
Support Vector Machine (SVM):
SVM is a powerful supervised learning algorithm used for classification
and regression. It works by finding the optimal hyperplane that best
separates data points of different classes. By using kernels, it can
model complex, non-linear boundaries. SVMs are robust to
high-dimensional data but can be sensitive to noise.
Decision Trees are a flowchart-like model used for classification and regression. They split data into subsets based on the most informative features, making decisions at each node. They are easy to visualize and interpret but prone to overfitting if not properly pruned.
library(ggplot2)
library(dplyr)
library(tidyr)
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:reshape2':
##
## smiths
library(factoextra)
library(cluster)
library(plotly)
library(caret)
## Loading required package: lattice
# Split data into training and testing
set.seed(123)
train_index <- createDataPartition(pca_data$DiabetesStatus, p = 0.7, list = FALSE)
train_data <- pca_data[train_index, ]
test_data <- pca_data[-train_index, ]
# Define the models to train
models <- list(
SVM = train(DiabetesStatus ~ PC1 + PC2 + PC3, data = train_data, method = 'svmRadial'),
KNN = train(DiabetesStatus ~ PC1 + PC2 + PC3, data = train_data, method = 'knn', tuneLength = 5),
DecisionTree = train(DiabetesStatus ~ PC1 + PC2 + PC3, data = train_data, method = 'rpart')
)
# Predict on test data & calculate performance metrics
results <- lapply(models, function(model) {
predictions <- predict(model, test_data)
cm <- confusionMatrix(predictions, test_data$DiabetesStatus)
data.frame(Accuracy = cm$overall['Accuracy'])
})
# Combine results into a single data frame for comparison
results_df <- do.call(rbind, results)
results_df <- cbind(Model = names(models), results_df)
print(results_df)
## Model Accuracy
## SVM SVM 0.7321429
## KNN KNN 0.7678571
## DecisionTree DecisionTree 0.7232143
# Plotly 3D Scatter Plot with Decision Boundaries (KNN Example)
grid <- expand.grid(PC1 = seq(min(pca_data$PC1), max(pca_data$PC1), length.out = 50),
PC2 = seq(min(pca_data$PC2), max(pca_data$PC2), length.out = 50),
PC3 = seq(min(pca_data$PC3), max(pca_data$PC3), length.out = 50))
predictions <- predict(models$KNN, grid)
grid$Prediction <- predictions
plot_ly() %>%
add_markers(data = pca_data, x = ~PC1, y = ~PC2, z = ~PC3, color = ~DiabetesStatus, colors = c('red', 'green', 'blue'),
marker = list(size = 3)) %>%
add_markers(data = grid, x = ~PC1, y = ~PC2, z = ~PC3, color = ~Prediction, opacity = 0.2, marker = list(size = 2)) %>%
layout(title = '3D PCA Plot with Decision Boundaries',
scene = list(xaxis = list(title = 'PC1'),
yaxis = list(title = 'PC2'),
zaxis = list(title = 'PC3')))